---
title: 'Analysis'
author: "Julia Christensen"
output: pdf_document
geometry: margin=.50in
fontsize: 10pt
keep_tex: yes
---

***********************************************
# 1) SET UP 

```{r setup, include=TRUE}
### Clear global environment
rm(list=ls()) 
gc()

### Libraries:
# install.packages("pacman") # Make sure fastlink_4.0 is installed 
library(pacman)
p_load(dplyr, tibble, # General functions
       magrittr, # General function for piping
       data.table, # Used to import vf
       fastLink, # Used in match functions
       here
)

### fastLink() dependencies 
p_load(Matrix,
    parallel,
    foreach,
    doParallel,
    gtools,
    data.table,
    stringdist,
    stringr,
    stringi,
    Rcpp,
    FactoClass,
    adagio,
    dplyr,
    plotrix,
    grDevices,
    graphics)

### File paths
here::i_am("Survey and App Match/110_run_matches.Rmd")
file_path <- here()
file_path_data_matched <- here("Survey and App Match/Temp_Data/")
vf_file_path <- paste0("D:/Data_VF_Unzipped/")

### Set path to match file 
fp110 <- paste0(file_path,"/Analysis/109_match_tfa_to_vf_template.R")

### Run file with functions running + saving fastLink() match output
source(paste0(file_path,"/Functions/functions_save_match_output_by_parts_v3.R"))

### List of Match Variables
all_match_vars <- c("FirstName", "LastName", "Sex", 
                    "BirthYear",
                    "MailingAddressZip5","RegistrationAddressZip5")

# Columns to keep
cols_to_keep <- c("DT_ID",  "DT_RegID",  "StateVoterID",  "State",  "NamePrefix",
                  "FirstName",  "MiddleName",  "LastName",  "NameSuffix",  "Sex",
                  "BirthYear",  "BirthMonth",  "BirthDay",  "RegisteredParty",
                  "DTCalcParty",  "RegisteredParty_RollUp",  "SelfReportedDemographic",
                  "ModeledEthnicity",  "Race",  "CountyFIPS",  "PrecinctNumber",
                  "PrecinctName",  "RegistrationAddress1",  "RegistrationAddress2",
                  "RegistrationAddressZip5",  "RegistrationAddressLatitude",
                  "RegistrationAddressLongitude",  "MailingAddress1",
                  "MailingAddress2",  "MailingAddressZip5",  "LandLine_AreaCode",  
                  "LandLine_Number",  "CellPhone_AreaCode",  "CellPhone_Number",
                  "CellPhone_SourceCode",  "CellPhone_MatchLevel",
                  "CellPhone_ReliabilityCode",  "LastActiveDate",  "RegistrationDate",
                  "VoterStatus",  "PermanentAbsenteeFlag",  "VH16G",  "VH16P",  "VH16PP",
                  "VH15G",  "VH15P",  "VH14G",  "VH14P",  "VH13G",  "VH13P",  "VH12G",  "VH12P",
                  "VH12PP",  "VH11G",  "VH11P",  "VH10G",  "VH10P",  "VH09G",  "VH09P",  "VH08G",
                  "VH08P",  "VH08PP",  "VH07G",  "VH07P",  "VH06G",  "VH06P",  "VH05G",  "VH05P",
                  "VH04G",  "VH04P",  "VH04PP",  "VH03G",  "VH03P",  "VH02G",  "VH02P")

```

***********************************************
# 2) RUN STATE MATCHES

## 2.1. HI, AK, AL, AR, CT
```{r}
ABB <- "HI"
source(fp110)
```
STATE: HI
Calculating matches for each variable took 1.55 minutes.
Getting counts for parameter estimation took 1.6 minutes.
Getting the indices of estimated matches took 1.01 minutes.

```{r}
ABB <- "AK"
source(fp110)

ABB <- "AL"
source(fp110)

ABB <- "AR"
source(fp110)

ABB <- "DE"
source(fp110)
```
STATE: AK
Calculating matches for each variable took 0.77 minutes
Getting counts for parameter estimation took 0.15 minutes.
Getting the indices of estimated matches took 0.1 minutes.

STATE: AL
Calculating matches for each variable took 6.61 minutes.
Getting counts for parameter estimation took 9.220000000000001 minutes.
Getting the indices of estimated matches took 6.26 minutes.

STATE: AR
Calculating matches for each variable took 4.77 minutes.
Getting counts for parameter estimation took 2.41 minutes.
Getting the indices of estimated matches took 1.61 minutes.

STATE: DE
Calculating matches for each variable took 1.94 minutes.
Getting counts for parameter estimation took 1.26 minutes.
Getting the indices of estimated matches took 0.8100000000000001 minutes.


*****************************************
## 2.2. DC, OR, CO, NY, AZ
```{r}
ABB <- "DC" 
source(fp110)

ABB <- "OR" 
source(fp110)

ABB <- "CO" 
source(fp110)
```
STATE: DC
Calculating matches for each variable took 2.81 minutes.
Getting counts for parameter estimation took 5.6 minutes.
Getting the indices of estimated matches took 3.12 minutes.

STATE: OR
Calculating matches for each variable took 23.74 minutes.
Getting counts for parameter estimation took 8.91 minutes.
Getting the indices of estimated matches took 6.17 minutes.

STATE: CO
Calculating matches for each variable took 34.46 minutes.
Getting counts for parameter estimation took 10.93 minutes.
Getting the indices of estimated matches took 7.57 minutes.


```{r}
ABB <- "NY" 
source(fp110)
```
STATE: NY
Calculating matches for each variable took 658.86 minutes.
Getting counts for parameter estimation took 377.77 minutes.
Getting the indices of estimated matches took 277.32 minutes.


```{r}
ABB <- "AZ" 
source(fp110)
```
STATE: AZ
Calculating matches for each variable took 40.13 minutes.
Getting counts for parameter estimation took 26.08 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 16.81 minutes.


*****************************************
## 2.3. FL, TX, GA, NC, NJ
```{r}
ABB <- "FL" 
source(fp110)
```
STATE: FL
Calculating matches for each variable took 414.46 minutes.
Getting counts for parameter estimation took 181.11 minutes.
Getting the indices of estimated matches took 127.2 minutes.


```{r}
ABB <- "TX" 
source(fp110)
```
STATE: TX
Calculating matches for each variable took 346.59 minutes.
etting counts for parameter estimation took 280.01 minutes.
Getting the indices of estimated matches took 190.57 minutes.


```{r}
ABB <- "GA" 
source(fp110)
```
STATE: GA
Calculating matches for each variable took 79.77 minutes.
Getting counts for parameter estimation took 101.11 minutes.
Getting the indices of estimated matches took 58.65 minutes.

```{r}
ABB <- "NC" 
source(fp110)

ABB <- "NJ" 
source(fp110)
```
STATE: NC
Calculating matches for each variable took 84.08 minutes.
Getting counts for parameter estimation took 108.9 minutes.
Getting the indices of estimated matches took 65.15000000000001 minutes.

STATE: NJ
Calculating matches for each variable took 114.92 minutes.
Getting counts for parameter estimation took 73.79000000000001 minutes.
Getting the indices of estimated matches took 46.53 minutes.


*****************************************
## 2.4. IA, ID, IL, CT, PA
```{r}
ABB <- "IA" 
source(fp110)

ABB <- "ID" 
source(fp110)

ABB <- "IL" 
source(fp110)

ABB <- "CT" 
source(fp110)
```
STATE: IA
Calculating matches for each variable took 4.74 minutes.
Getting counts for parameter estimation took 4.08 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 2.82 minutes.

STATE: ID
Calculating matches for each variable took 0.96 minutes.
Getting counts for parameter estimation took 0.58 minutes
Getting the indices of estimated matches took 0.42 minutes.

STATE: IL
Calculating matches for each variable took 213.32 minutes.
Getting counts for parameter estimation took 143.25 minutes.
Getting the indices of estimated matches took 95.68000000000001 minutes.

STATE: CT
Calculating matches for each variable took 18.81 minutes.
Getting counts for parameter estimation took 16.11 minutes.
Getting the indices of estimated matches took 9.630000000000001 minutes.

```{r}
ABB <- "PA" 
source(fp110)
```
STATE: PA
Calculating matches for each variable took 201.8 minutes.
Getting counts for parameter estimation took 134.55 minutes.
Getting the indices of estimated matches took 87.81999999999999 minutes.


*****************************************
## 2.5. IN, KS, KY, LA, MA
```{r}
ABB <- "IN" 
source(fp110)

ABB <- "KS" 
source(fp110)

ABB <- "KY" 
source(fp110)

ABB <- "LA" 
source(fp110)

ABB <- "MA" 
source(fp110)
```
STATE: IN
Calculating matches for each variable took 23.88 minutes.
Getting counts for parameter estimation took 25.42 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 15.54 minutes.

STATE: KS
Calculating matches for each variable took 3.66 minutes.
Getting counts for parameter estimation took 3.49 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 2.35 minutes.

STATE: KY
Calculating matches for each variable took 6.72 minutes.
Getting counts for parameter estimation took 7.72 minutes.
Getting the indices of estimated matches took 5.36 minutes.

STATE: LA
Calculating matches for each variable took 8.289999999999999 minutes.
Getting counts for parameter estimation took 17.86 minutes.
Getting the indices of estimated matches took 10.49 minutes.

STATE: MA
Calculating matches for each variable took 75.94 minutes.
Getting counts for parameter estimation took 66.8 minutes.
Getting the indices of estimated matches took 42.17 minutes.


*****************************************
## 2.6. MD, ME, MI, MN, MO
```{r}
ABB <- "MA" 
source(fp110)

ABB <- "MD" 
source(fp110)

ABB <- "ME" 
source(fp110)

ABB <- "MI" 
source(fp110)

ABB <- "MN" 
source(fp110)

ABB <- "MO" 
source(fp110)
```
STATE: MD
Calculating matches for each variable took 44.49 minutes.
Getting counts for parameter estimation took 32.93 minutes.
Getting the indices of estimated matches took 19.91 minutes.

STATE: ME
Calculating matches for each variable took 1.59 minutes.
Getting counts for parameter estimation took 1.56 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 1.23 minutes.

STATE: MI
Calculating matches for each variable took 88.51000000000001 minutes.
Getting counts for parameter estimation took 70.06 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 56.89 minutes.

STATE: MN --- SOMETHING WENT WRONG???
Calculating matches for each variable took 18.4 minutes.
Getting counts for parameter estimation took 18.63 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 11.56 minutes.

STATE: MO
Calculating matches for each variable took 22.56 minutes.
Getting counts for parameter estimation took 21.95 minutes.
Getting the indices of estimated matches took 14.05 minutes.


*****************************************
## 2.7. MS, MT, ND, NE, NH
```{r}
ABB <- "MS"
source(fp110)

ABB <- "MT" 
source(fp110)
```
STATE: MS
Calculating matches for each variable took 3.16 minutes.
Getting counts for parameter estimation took 7.64 minutes.
Getting the indices of estimated matches took 4.6 minutes.

STATE: MT
Calculating matches for each variable took 0.86 minutes.
Getting counts for parameter estimation took 0.26 minutes.
Getting the indices of estimated matches took 0.19 minutes.


```{r}
ABB <- "ND" 
source(fp110)

ABB <- "NE"  
source(fp110)

ABB <- "NH"  
source(fp110)
```
STATE: ND
Calculating matches for each variable took 0.66 minutes.
Getting counts for parameter estimation took 0.11 minutes.
Getting the indices of estimated matches took 0.09 minutes.

STATE: NE
Calculating matches for each variable took 1.57 minutes.
Getting counts for parameter estimation took 1.01 minutes.
Getting the indices of estimated matches took 0.71 minutes.

STATE: NH
Calculating matches for each variable took 2.04 minutes.
Getting counts for parameter estimation took 2.26 minutes.
Getting the indices of estimated matches took 1.45 minutes.


*****************************************
## 2.8. NM, NV, OK, RI, SC
```{r}
ABB <- "NM" 
source(fp110)

ABB <- "NV" 
source(fp110)

ABB <- "OK" #est 30 min
source(fp110)

ABB <- "RI" #est 5 min
source(fp110)

ABB <- "SC" #est 40 min
source(fp110)
```
STATE: NM
Calculating matches for each variable took 2.18 minutes.
Getting counts for parameter estimation took 1 minutes.
Getting the indices of estimated matches took 0.75 minutes.

STATE: NV
Calculating matches for each variable took 5.01 minutes.
Getting counts for parameter estimation took 2.2 minutes.
Getting the indices of estimated matches took 1.53 minutes.

STATE: OK
Calculating matches for each variable took 4.59 minutes.
Getting counts for parameter estimation took 5.12 minutes.
Getting the indices of estimated matches took 3.09 minutes.

STATE: RI
Calculating matches for each variable took 1.69 minutes.
Getting counts for parameter estimation took 1.55 minutes.
Getting the indices of estimated matches took 1.04 minutes.

STATE: SC
Calculating matches for each variable took 9.9 minutes.
Getting counts for parameter estimation took 9.539999999999999 minutes.
Getting the indices of estimated matches took 6.38 minutes.


*****************************************
## 2.9. SD, TN, UT, VA, VT
```{r}
ABB <- "SD" 
source(fp110)

ABB <- "TN" 
source(fp110)

ABB <- "UT" 
source(fp110)

ABB <- "VA" 
source(fp110)

ABB <- "VT" #no bday or bmonth 
source(fp110)
```
STATE: SD
Calculating matches for each variable took 0.96 minutes.
Getting counts for parameter estimation took 0.46 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet
Getting the indices of estimated matches took 0.33 minutes.

STATE: TN
Calculating matches for each variable took 28.96 minutes.
Getting counts for parameter estimation took 28.55 minutes.
Getting the indices of estimated matches took 17.28 minutes.

STATE: UT
Calculating matches for each variable took 2.6 minutes.
Getting counts for parameter estimation took 3.3 minutes.
Getting the indices of estimated matches took 2.18 minutes

STATE: VA
Calculating matches for each variable took 68.58 minutes.
Getting counts for parameter estimation took 55.1 minutes.
Getting the indices of estimated matches took 34.54 minutes.

STATE: VT
Calculating matches for each variable took 0.72 minutes.
Getting counts for parameter estimation took 0.3 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 0.21 minutes.


*****************************************
## 2.10. WA, WI, WV, WY, OH
```{r}
ABB <- "WA" 
source(fp110)

ABB <- "WI" 
source(fp110)

ABB <- "WV" #est 1 min
source(fp110)

ABB <- "WY" 
source(fp110)

ABB <- "OH" 
source(fp110)
```
STATE: WA
Calculating matches for each variable took 30.38 minutes.
Getting counts for parameter estimation took 24.94 minutes.
Getting the indices of estimated matches took 16.23 minutes.

STATE: WI
Calculating matches for each variable took 43.18 minutes.
Getting counts for parameter estimation took 31.64 minutes.
Getting the indices of estimated matches took 19.99 minutes.

STATE: WV
Calculating matches for each variable took 1.12 minutes.
Getting counts for parameter estimation took 0.64 minutes.
Getting the indices of estimated matches took 0.48 minutes.

STATE: WY
Calculating matches for each variable took 0.5600000000000001 minutes.
Getting counts for parameter estimation took 0.08 minutes.
The EM algorithm has run for the specified number of iterations but has not converged yet.
Getting the indices of estimated matches took 0.06 minutes.

STATE: OH
Calculating matches for each variable took 107.46 minutes.
Getting counts for parameter estimation took 87.39 minutes.
Getting the indices of estimated matches took 53.54 minutes.


*****************************************
## 2.11. CA
```{r}
ABB <- "CA" #est 2 days
source(fp110)
```
STATE: CA
Calculating matches for each variable took 1136.23 minutes.
Getting counts for parameter estimation took 681.6 minutes.
Getting the indices of estimated matches took 502.58 minutes.


















